My goal with this project was to attempt to answer the age old sports question of “So, Who is going to win this game?”
I wanted to see if I could use historic information about a teams offensive and defensive performances to garner an accurate prediction on whether or not they will win there upcoming game.
All data ingestion, processing, analysis, and modeling was done in R
and I leaned heavily on the tidyverse and
tidymodels ecosystems to get this done.
I pulled NFL play-by-play data together using the
nflFastR R package. I then did some data cleaning and
wrangling tasks to get the data into a usable format.
I started by getting the full play-by-play datasets from 1999-2021. Each year there are some ~50,000 plays run, and for each play, there are 372 columns of data giving information about what happened on that play.
I created a handful of helper functions that take the
nflFast R play-by-play data as an input and then aggregates
the data to game level data.
In the end I used data from the 1999-2021 seasons, however I completed omitted the entire 2021 season to use as a holdout set of data to test our models against.
Let’s dig into some of the data and see what we can learn.
Below is a box plot showing the season win totals for each NFL team over the period of record, arranged by mean annual win total. It is not surprising to see the New England Patriots leading the league with an average of 12.6 wins per season. The Patriots are followed by the Steelers, Packers, Colts, Ravens, all teams that have seen consistent success over the last 23 years. At the bottom of the league are the Browns, Lions, Jaguars, and the Raiders.
Now how about which team’s quarterbacks have the highest EPA. EPA is a metric that determines how likely a team is to score points as a result of a play. If a QB does something good (long completion, run for a first down), then there team is more likely to score points on the ensuing play, thus Expected points were added (EPA is positive). On the other hand if a QB does something bad (incompletion, takes a big sack), then there team is less likely to score points on the ensuing play and expected points were removed (EPA is negative)
Below is a plot showing the for each team, the average QB EPA across the all seasons. It is not suprising to see teams like the Patriots, Packers, and Colts at the top of the league in terms of average QB EPA seeing as though these teams had Hall of Fame Quarterbacks for most of the period of time, Tom Brady, Brett Favre/Aaron Rodgers, and Peyton Manning, respectively. This plot matches very closely with the above season win totals boxplot. The teams with the best quarterbacks tend to have the most total wins at the end of a season.
I decided I wanted to capture 4 key domains of features that may influence a teams likelihood to win an upcoming game:
Note: While exploring the data and creating the
features I wanted to use in the model, I created a host of utility
helper functions that work off of the play-by-play output data from the
nflFastR package. Those helper functions can be found here.
I created an Elo rating system for each season to create a metric that keeps track of a teams rank relative to the rest of the league. Elo rating systems were first created to rate chess players and are now commonly used in many sports such as American Football, baseball, basketball, etc. Special thanks to the creators of the elo package, your package made my life a lot easier. Here is more information on Elo Rating Systems and its inventor Arpad Elo.
I calculated offensive and defensive statistics for every game a team played in the NFL from 1999 - 2021.
| Variable | Type | Description |
|---|---|---|
| Score Differential | Numeric | Average score differential throughtout the game |
| Third Down Conversion Rate | Numeric | Percent of 3rd downs converted by offense |
| Turnovers | Numeric | Total Turnovers by the offense (Fumbles lost + Interceptions |
| QB EPA | Numeric | Average Quarterback EPA across all plays in the game |
| Scoring Drive percentage | Numeric | Number of drives that results in a score (touchdown/fieldgoal) |
I replicated many of the same offensive statistics for each team’s defense to represent how opposing offenses performed against there defense.
| Variable | Type | Model Variable | Description |
|---|---|---|---|
| Opponent Points Scored in quarter 1, 2, 3, 4 | Numeric | def_qtr_pts_1,2,3,4 | Total Points allowed by defense at the end of each quarter |
| Opponent Third Down Conversion Rate | Numeric | def_third_down_pct | Percent of 3rd downs converted by opposing offenses |
| Opponent Turnovers | Numeric | def_turnovers | Total Turnovers created by defense (Fumbles lost + Interceptions |
| Opponent QB EPA | Numeric | def_qb_epa | Average Quarterback EPA allowed by defense across all plays in the game |
| Opponent Scoring Drive percentage | Numeric | def_score_drives_pct | Percent of Drives against the defense that resulted in points |
I included a handful of other information known prior to the game such as if the game is a home game, a division game, the number of days of rest between games, and the team’s overall, home, and away winning percentages, and the team’s ranking via an Elo rating.
| Variable | Type | Model Variable | Description |
|---|---|---|---|
| Home | Binary | home | Indicating if the team is the home team (1 = home, 0 = away |
| Division Game | Binary | div_game | Indicating if it is a division game (1 = division game, 0 = not division game |
| Rest days | Numeric | rest_days | Numeric variable for the amount of rest the team had between games |
| Win % | Numeric | win_pct | Percent of all games won |
| Home Win % | Numeric | home_win_pct | Percent of home games won |
| Away Win % | Numeric | away_win_pct | Percent of away games won |
| Elo rating | Numeric | elo | Elo rating per week in each season, relative to rest of the league |
Once all of the above variables were wrangled, cleaned, and aggregated from the NFL play-by-play data, I was left with a data frame that had 1 row for every game a team played during the season, with a column for each of the aforementioned derived variables.
Let’s take a look at how our game summary data relates to wins.
Below is a correlation matrix that highlights the relationship between variables in our data. We are most interested in the bottom row in this plot which indicates the correlation between a team winning there game and the other variables in our data set. Darker green colors indicate positive correlations while darker red colors indicate negative correlations.
We see that Elo rating has a relatively strong positive correlation with winning, which makes alot of sense, a higher Elo rating (higher ranked team) is more likely to end up with a win. The average score differential for a team also makes good sense, if a team on average has a higher score differential (they score more points than are scored against them) then it’s logical that they would end up winning more games. The turnover variable has a strong negative correlation with winning, so if a team on average has fewer turnovers, they are more likely to end up winning the game.
At this point in my analysis, I was working with a posteriori data, so data that was collected as a product of what happened on the field that week. If I want to make any useful predictions about the coming week, I would need to conjur up some a priori data.
To capture how well a team is doing throughout each season, I created lagged cumulative means for all of the above variables. Such that, for each week a team plays a game, the cumulative mean of all the preceeding weeks of data are calculated for every variable leading into the upcoming slate of games.
Below is the general method I used to get these lagged Cumulative averages values. Note, for some variables like winning percentage, the lagged cumulative averages was not calculated and rather only the lagged value was determined. In the example code below, the lagged winning percentage, and the lagged cumulative average QB EPA is calculated for the Arizona Cardinals 2014 season.
nfl %>%
dplyr::filter(season == 2014, team == "ARI") %>%
dplyr::select(season, week, team, win, win_pct, qb_epa) %>%
dplyr::group_by(season, team) %>%
dplyr::arrange(season, week, .by_group = T) %>%
dplyr::mutate(
across(c(win_pct), ~dplyr::lag(.x), .names = "{col}_lag"),
across(c(qb_epa), ~dplyr::lag(dplyr::cummean(.x)), .names = "{col}_lag")
) %>%
dplyr::mutate(across(c(win_pct:qb_epa_lag), round, 2)) %>%
dplyr::ungroup() %>%
dplyr::relocate(season, week, team, win, win_pct, win_pct_lag, qb_epa, qb_epa_lag)## # A tibble: 17 × 8
## season week team win win_pct win_pct_lag qb_epa qb_epa_lag
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014 1 ARI 1 1 NA -0.1 NA
## 2 2014 2 ARI 1 1 1 0.03 -0.1
## 3 2014 3 ARI 1 1 1 0.2 -0.04
## 4 2014 5 ARI 0 0.75 1 -0.16 0.04
## 5 2014 6 ARI 1 0.8 0.75 -0.01 -0.01
## 6 2014 7 ARI 1 0.83 0.8 0.09 -0.01
## 7 2014 8 ARI 1 0.86 0.83 -0.03 0.01
## 8 2014 9 ARI 1 0.88 0.86 -0.02 0
## 9 2014 10 ARI 1 0.89 0.88 -0.11 0
## 10 2014 11 ARI 1 0.9 0.89 -0.06 -0.01
## 11 2014 12 ARI 0 0.82 0.9 -0.34 -0.02
## 12 2014 13 ARI 0 0.75 0.82 -0.09 -0.05
## 13 2014 14 ARI 1 0.77 0.75 0 -0.05
## 14 2014 15 ARI 1 0.79 0.77 -0.14 -0.05
## 15 2014 16 ARI 0 0.73 0.79 -0.25 -0.05
## 16 2014 17 ARI 0 0.69 0.73 0.09 -0.07
## 17 2014 18 ARI 0 0.65 0.69 -0.4 -0.06
If you take a look at ARI’s week 5 game (row 5), the win_pct_lag column for week 5 is represented by the teams win_pct from the previous week, week 4 in this case, or in other words, the team’s winning percentage leading into the week 5 match up. On the other hand, the qb_epa_lag column was calculated using the mean of qb_epa from weeks 1-4.
I did this lagging process across all of my variables such that for each row with a game outcome (win/loss), that team has variables representing their performances up until that point in the season.
Now our data is set up so that all the information used to inform our prediction is data that would be available to us prior to the upcoming week of games. This is important because information such as the number of turnovers during the game is not information we would have prior to the game, and thus can’t be used as a predictor in our model, we can only use historic data to inform our predictions.
Initially I had been running models with 2 data points for each game, one being the home team perspective and the other being the away team perspective. I pretty quickly ran into some major overfitting in my models and realized it was a better idea to only use data from the home team.
I decided I wanted to run my data across a panel of 6 different models and see how they all perform against each other
The first step was to split up our data into testing and training splits (75% training, 25% testing). When I split my data, I chose to stratify the data by the binary win/loss value to ensure there was the same proportion of wins and losses in our training and testing data splits.
# Set random seed
set.seed(234)
# Partition training/testing data, stratified by win/loss
nfl_split <- rsample::initial_split(nfl_df, strata = win)
# training data split - 75%
nfl_train <- rsample::training(nfl_split)
# testinng data split - 25%
nfl_test <- rsample::testing(nfl_split)Using the recipes package, I made ID variables for
information about the game and when it happened. I then applied some
recipe steps to my training data. Because I decided to only use the home
teams data, this meant that there was going to be more wins than losses
in my dataset because the home team tends to win more often then the
away team (56% wins / 44% losses). To account for the slight imbalance
of wins to losses in the data, I chose to use the themis
package to upsample my data using the themis::step_smote()
function.
step_zv is applied to all predictors to remove all
variables with only one variable (zero-variance)step_normalize is used to normalize all the numeric
predictors in the training data so that they all have a standard
deviation of 1 and a mean of 0step_smote is applied to our dependent
win variable to fix the class imbalance due to having
slightly more wins than losses in the data. The step_smote
function works by upsampling the minority class and generating new,
synthetic data by using the nearest-neighbors of the minority class
instances.Below I create the two recipe steps that we will apply to our training data before modeling.
# Base recipe
base_recipe <-
recipes::recipe(
formula = win ~ .,
data = nfl_train
) %>%
recipes::update_role(
game_id, team, opponent, season, week, new_role = "ID"
)
# Normalize, SMOTE algo upsampling recipe
norm_smote_recipe <-
base_recipe %>%
recipes::step_zv(recipes::all_predictors()) %>%
recipes::step_normalize(recipes::all_numeric_predictors())
themis::step_smote(win, over_ratio = 0.9, skip = T)
# zero variance SMOTE algo upsampling recipe
zv_smote_recipe <-
base_recipe %>%
recipes::step_zv(all_predictors()) %>%
themis::step_smote(win, over_ratio = 0.9, skip = T)I created 10 cross-validation folds from the training data. As I did with the initial training/testing split, I made sure to create stratified resamples using the win/loss variable to ensure the same proportion of wins and losses appear in CV folds as the do in the original data.
# Set random seed
set.seed(432)
# Cross-validation folds
nfl_folds <- rsample::vfold_cv(nfl_train, v = 10, strata = win)Using the workflowsets I created a
workflow_set object containing each of the data
preprocessing recipes and corresponding model specifications.
# Workflow set of candidate models
nfl_wfs <-
workflowsets::workflow_set(
preproc = list(
kknn_rec = norm_smote_recipe,
glmnet_rec = norm_smote_recipe,
xgboost_rec = zv_smote_recipe,
nnet_rec = norm_smote_recipe,
svm_poly_rec = norm_smote_recipe,
svm_rbf_rec = norm_smote_recipe
),
models = list(
knn = knn_spec,
glmnet = glmnet_spec,
xgboost = xgboost_spec,
nnet = nnet_spec,
svm_poly = svm_poly_spec,
svm_rbf = svm_spec
),
cross = F
)The next step was to tune our hyperparameters for each of models.
Using the tune package I applied the
tune::tune_grid() function across my workflowset object
containing my model recipes and specifications. 20 random candidate
parameters were created using the
dials::grid_latin_hypercube() function from the
dials package
This step is the most time consuming and resource intensive process of a machine learning workflow. So, to speed up the process I run the tuning process across multiple cores on my computer, making use of my computer’s multiple processors.
To get another boost in processing time, I made use of the
tune_race_anova from the finetune package,
which makes use of a repeated measure ANOVA model to iterativily
eliminates tuning parameters that are unlikely to yield the best
results. Combining parrallel processing and
racing methods can improve processing times by ~35
fold.
# Choose metrics
my_metrics <- yardstick::metric_set(roc_auc, pr_auc, accuracy, mn_log_loss)
# Set up parallelization, using computer's other cores
parallel::detectCores(logical = FALSE)
modeltime::parallel_start(6, .method = "parallel")
# Set Random seed
set.seed(589)
# Tune models in workflowset
nfl_wfs <-
nfl_wfs %>%
workflowsets::workflow_map(
"tune_grid",
resamples = nfl_folds ,
grid = 20,
metrics = my_metrics,
control = tune::control_grid(
verbose = TRUE,
save_pred = TRUE),
verbose = TRUE
)
# Stop parrallelization
modeltime::parallel_stop()
Now we can compare how all of our models did compared to one another and make a decision as to which one we want to use to make predictions. The plots below shows how the 6 models performed on the set of resampling data.
If you take a look at the ROC AUC and mean log loss plots the best performing models are the Logistic Regression (logistic_reg), the Support Vector Machines (svm_poly/svm_rbf), the Multilayer Perceptron (mlp) and the Gradient Boosted Decision Trees (boost_trees).
I decided to set aside the Multilayer Perceptron model due to the slight increase in mean log loss relative to the other top performing models.
In particular, the boosted trees looks to perform at approximately the same level asthe mlp model when it comes from correctly predicting wins (sensitivity) and correctly predicting losses (specificity), without the increase in log loss.
A standard threshold for log loss when it comes to binary predictions is ~0.69 because anything higher than that and you aren’t beating the probability of a 50-50 guess. The table below shows the mean log loss for each model. The best log loss you can have is 0 and the larger the number gets the worse the model is at predicting the correct outcome.
The table belows shows the performance metrics for how each model.
Resample Metrics | ||||||
model | rank | accuracy | mean_log_loss | roc_auc | sensitivity | specificity |
logistic_reg | 1 | 0.671 | 0.595 | 0.738 | 0.707 | 0.624 |
svm_poly | 2 | 0.670 | 0.597 | 0.736 | 0.704 | 0.626 |
svm_rbf | 3 | 0.669 | 0.600 | 0.733 | 0.703 | 0.624 |
mlp | 4 | 0.668 | 0.626 | 0.732 | 0.682 | 0.649 |
boost_tree | 6 | 0.663 | 0.603 | 0.729 | 0.684 | 0.635 |
nearest_neighbor | 12 | 0.622 | 0.798 | 0.661 | 0.629 | 0.612 |
resamp_metrics %>%
dplyr::arrange(mean_log_loss) %>%
ggplot() +
geom_point(aes(x = reorder(model, mean_log_loss), y = mean_log_loss), size = 2) +
geom_hline(yintercept = 0.69, size = 1.5) Looking at at variable importance plots, generated from the
vip package, can give us an idea of which variables are the
most important to ours models.
Below is the variable importance scores for the Logistic Regression and the XGBoost models.
The most important variables in both models looks to be the team’s Elo ratings, the cumulative average score differential and the opponent’s winning percentage.
I then selected the tuning parameters for each model that had the best performance metrics on the data resamples. Using these model parameters, I refit each model for the last time using just the initial data split, fitting one last time on the training data and evaluating on the testing data. I then used the final fitted models to make predictions. Let’s check out the results!
And in table form…
Best model metrics | ||||||
Model | Data split | accuracy | mean_log_loss | roc_auc | sensitivity | specificity |
nearest_neighbor | train | 0.69 | 0.57 | 0.77 | 0.68 | 0.69 |
nearest_neighbor | test | 0.63 | 0.85 | 0.67 | 0.63 | 0.63 |
logistic_reg | train | 0.67 | 0.59 | 0.74 | 0.70 | 0.63 |
logistic_reg | test | 0.69 | 0.58 | 0.76 | 0.72 | 0.65 |
boost_tree | train | 0.71 | 0.60 | 0.78 | 0.74 | 0.68 |
boost_tree | test | 0.69 | 0.61 | 0.75 | 0.72 | 0.65 |
mlp | train | 0.67 | 0.62 | 0.74 | 0.68 | 0.66 |
mlp | test | 0.69 | 0.62 | 0.76 | 0.70 | 0.68 |
svm_poly | train | 0.68 | 0.59 | 0.74 | 0.71 | 0.63 |
svm_poly | test | 0.68 | 0.58 | 0.76 | 0.71 | 0.64 |
svm_rbf | train | 0.68 | 0.58 | 0.75 | 0.72 | 0.63 |
svm_rbf | test | 0.69 | 0.58 | 0.76 | 0.72 | 0.65 |
The nearest neighbor model sees a big drop in mean log loss between the training and testing data, besides this, the remaining models appear to perform similarly with both the training and testing data, which gives us some more confidence that these models would perform similarly in the real world.
Receiver Operator Characteristic curves, (aka ROC curves) are a way of understanding how well a model is predicting the True Positive events vs. the False Positive events.
In our scenario, with models that are trying to predict if an NFL team will win, a True Positive event would be that the model predicts a win and the team actually wins, while in the False Positive event the model predicts a win, and the team actually loses.
The True Positive Rate (TPR) is on the Y axis and the False Positive Rate (FPR) is on the X axis.
The TPR, or sensitivity, along the Y axis indicates the probability that the model predicted a win and the team actually won. While FPR or, 1 - specificity, along the X axis indicates the probability that the model predicted a win and the team actually lost.
The closer the ROC curve is to the top left corner of the plot the better the model is doing at correctly classifying the data. A perfect ROC Curve would be a 90 degree angle along the top left corner of the plot and would indicate that the model is able to perfectly classify the data correctly.
roc_curve <-
roc_df %>%
ggplot2::ggplot() +
ggplot2::geom_abline(
lty = 2,
alpha = 0.7,
color = "black",
size = 1) +
ggplot2::geom_line(
ggplot2::aes(
x = 1 - specificity,
y = sensitivity,
color = model),
alpha = 0.6,
size = 1
) +
# ggplot2::scale_color_manual(values = RColorBrewer::brewer.pal(6, "Dark2")) +
ggplot2::scale_color_manual(values = c( "red3", "forestgreen", "dodgerblue3", "coral3", "cyan3", "darkorange")) +
ggplot2::coord_equal() +
ggplot2::labs(
title = "ROC Curves",
subtitle = "Test data",
x = "False Positive Rate (1 - Specificity)",
y = "True Positive Rate (Sensitivity)"
) +
apatheme +
ggplot2::theme(
axis.title.x = ggplot2::element_text(hjust=0.5, vjust = -2),
axis.title.y = ggplot2::element_text(hjust=0.5, vjust = 2)
)
# ggsave(
# here::here("img", "roc_curve.png"),
# roc_curve,
# width = 12,
# height = 10
# )No surprises here, using the test data all of the models are performing similarly, except for the K-nearest neighbors model, which is much worse at classifying the data then the rest of the models. This ROC curve just visually confirms what we saw in the model metric summary table. Based on these metrics, I am going to drop the K nearest neighbors model for now and focus on the rest of the models
A confusion matrix is a helpful way of understanding where a classification model is correct or incorrect in it’s predictions. At first glance, I thought the name confusion matrix came from the fact that I was confused. Turns out the name has more meaning, its purpose is to highlight where a classification model is confused in its predictions…not when the modeler is confused…
So, if this is your first time seeing a confusion matrix this illustration might help.
Insert hot dog confusion matrix image
A few notes from looking at this confusion matrix: - All three of my models do a better job of correctly predicting losses in the test data than Las Vegas did in the same games
The Support Vector Machine has more total correct predictions than both other models
The SVM RBF Model is the highest sensitivity model, it correctly predict wins better than the all other models as well as Las Vegas’s predictions
The Logistic Reg. Model predicts losses better than the SVM and XGBoost models
~35% of correct predictions are losses, the remaining ~65% are correctly predicted wins (~43% of the test data were actual losses so this isn’t too far off)
For this data/domain, having the model predict more wins than losses makes logical sense, the home team tends to win more often than they lose
In general, the models have the most trouble classifying FP
And here are the summary statistics for these confusion matrices
confmat_tbl %>%
flextable::flextable() %>%
flextable::add_header_row(
values = c("", "Metrics"),
colwidths = c(1, 5)
) %>%
flextable::theme_box() %>%
flextable::align(align = "center", part = "all")Metrics | |||||
Model | Accuracy | Misclassification | Precision | Sensitivity | Specificity |
Vegas | 0.67 | 0.33 | 0.79 | 0.68 | 0.64 |
Logisitc Reg. | 0.69 | 0.31 | 0.72 | 0.73 | 0.63 |
SVM RBF | 0.69 | 0.31 | 0.72 | 0.73 | 0.64 |
XGBoost | 0.69 | 0.31 | 0.71 | 0.73 | 0.63 |
MLP | 0.69 | 0.31 | 0.70 | 0.74 | 0.63 |
Looking at the SVM model in the table above, we see a sensitivity of ~0.72, which means the SVM model correctly predicted the home team to win 72% of the time. The specificity of 0.73 indicates that 73% of home team losses were correctly predicted. Among all the models, the SVM model looks to do the best at correctly identifying wins and losses for the home team.
If we look at the 1373 games in our test data, and generate predictions for those games, we can calculate the percent of games in each year which our model correctly predicted the outcome to.
Now let’s apply the fitted models to the 2021 season holdout data and see how those predictions fare against Vegas’ predicted winners.
We can visualize percent of games that Las Vegas got correct in 2021 and compare it to the ML models % of correct predictions. The X axis has the percent of correct predictions and the Y axis is the week of the NFL season.
Looks like our predictions are right in line with Vegas, in some weeks we do better than Vegas are predicting the correct game outcomes and some weeks we do worse.
The table below shows the actual outcomes of the 2021 season, the favored team according to Las Vegas, and the prediction’s from my fitted models. Green highlighted cells indicate that the prediction lined up with the actual game outcome, and red highlighted cells mean that the prediction was incorrect. The table is from the home team’s persepctive (i.e. a win in the actual_outcome column means the home team won).
First we’ll can check out one of the weeks that our models outperformed Vegas.
All of our models did better than Vegas in week 12,
# A week with more preditions correct than Las Vegas
good_week <-
pred_table %>%
dplyr::filter(week == 12)
# Color code TP and TN as green and FP and FN as red
good_week_color <- ifelse(good_week[, c(5, 6, 7, 8, 9)] == good_week[, c(4, 4, 4, 4, 4)], "#85E088", "indianred2")
# Flextable with color coding
good_week %>%
flextable::flextable() %>%
flextable::bg(j = c(5:9), bg = good_week_color) %>%
flextable::add_header_row(
values = c(" ", "Vegas", "Models"), # values = c(" ","Vegas Predictions", "Model Predictions"),
colwidths = c(4, 1, 4)
# values = c(" ", "Vegas / Model Predictions"), colwidths = c(4, 4)
) %>%
flextable::theme_box() %>%
flextable::align(align = "center", part = "all")
| Vegas | Models | ||||||
week | home_team | away_team | actual_outcome | vegas_pred | log_reg_pred | svm_pred | xgb_pred | mlp_pred |
12 | BAL | CLE | win | win | loss | loss | win | loss |
12 | CIN | PIT | win | win | win | win | win | win |
12 | DAL | LV | loss | win | win | win | win | win |
12 | DEN | LAC | win | loss | win | win | loss | win |
12 | DET | CHI | loss | loss | loss | loss | loss | loss |
12 | GB | LA | win | loss | loss | loss | win | loss |
12 | HOU | NYJ | loss | win | win | loss | loss | loss |
12 | IND | TB | loss | loss | loss | loss | loss | loss |
12 | JAX | ATL | loss | loss | loss | loss | loss | loss |
12 | MIA | CAR | win | loss | win | win | win | win |
12 | NE | TEN | win | win | win | win | win | win |
12 | NO | BUF | loss | loss | loss | loss | loss | loss |
12 | NYG | PHI | win | loss | loss | loss | loss | loss |
12 | SF | MIN | win | win | win | win | win | win |
12 | WAS | SEA | win | loss | win | win | win | win |
And now a bad week…
# A week with more preditions correct than Las Vegas
bad_week <-
pred_table %>%
dplyr::filter(week == 15)
# Color code TP and TN as green and FP and FN as red
bad_week_color <- ifelse(bad_week[, c(5, 6, 7, 8, 9)] == bad_week[, c(4, 4, 4, 4, 4)], "#85E088", "indianred2")
# Flextable with color coding
bad_week %>%
flextable::flextable() %>%
flextable::bg(j = c(5:9), bg = bad_week_color) %>%
flextable::add_header_row(
values = c(" ", "Vegas", "Models"),
colwidths = c(4, 1, 4)
) %>%
flextable::theme_box() %>%
flextable::align(align = "center", part = "all")
| Vegas | Models | ||||||
week | home_team | away_team | actual_outcome | vegas_pred | log_reg_pred | svm_pred | xgb_pred | mlp_pred |
15 | BAL | GB | loss | loss | loss | loss | loss | loss |
15 | BUF | CAR | win | win | win | win | win | win |
15 | CHI | MIN | loss | loss | loss | loss | loss | loss |
15 | CLE | LV | loss | loss | win | loss | win | win |
15 | DEN | CIN | loss | win | win | win | win | win |
15 | DET | ARI | win | loss | loss | loss | loss | loss |
15 | IND | NE | win | loss | win | win | win | win |
15 | JAX | HOU | loss | win | win | win | loss | loss |
15 | LA | SEA | win | win | loss | loss | win | win |
15 | LAC | KC | loss | loss | loss | loss | loss | loss |
15 | MIA | NYJ | win | win | win | win | win | win |
15 | NYG | DAL | loss | loss | loss | loss | loss | loss |
15 | PHI | WAS | win | win | win | win | win | win |
15 | PIT | TEN | win | win | loss | loss | loss | loss |
15 | SF | ATL | win | win | win | win | win | win |
15 | TB | NO | loss | win | win | win | win | win |
We did it, we implemented NFL data into a Machine learning workflow and generated some reasonably accurate predictions!
To recap what I did: - Started with raw NFL play-by-play data - Cleaned and processed the data into a model-ready format - Selected a handful of ML models, - trained and tested the models with our prepped data, - Split our data into training and testing datasets - Trained our models using the training data and evaulated them using 10 fold cross validation. - Determined optimal hyperparameters and used these to refit our models on the entire dataset - Generated predictions and compared our predictions to the predicted favorites according to Las Vegas
I’m not a huge sports bettor, but I really just wanted to make a model that could potentially make me money if I was.